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Abstract 

Current analysis of astronomical data are confronted with the daunting task of modeling the awkward 
features of astronomical data, among which heteroscedastic (point-dependent) errors, intrinsic scatter, 
non-ignorable data collection (selection effects), data structure, non-uniform populations (often called 
Malmquist bias), non-Gaussian data, and upper/lower limits. This chapter shows, by examples, how 
modeling all these features using Bayesian methods. In short, one just need to formalize, using maths, 
the logical link between the involved quantities, how the data arize and what we already known on the 
quantities we want to study. The posterior probability distribution summarizes what we known on the 
studied quantities after the data, and we should not be afraid about their actual numerical computation, 
because it is left to (special) Monte Carlo programs such as JAGS. As examples, we show how to predict 
the mass of a new object disposing of a calibrating sample, how to constraint cosmological parameters 
from supernovae data and how to check if the fitted data are in tension with the adopted fitting model. 
Examples are given with their coding. These examples can be easily used as template for completely 
different analysis, on totally unrelated astronomical objects, requiring to model the same awkward data 
features. 

Keywords: Astrophysics; Cosmology; Bayesian statistics; Regression; Scaling relations; Prediction; 
Model testing 

1 Introduction 

Astronomical data present a number of quite common awkward features (see HI for a review): 

• heteroscedastic errors: error sizes vary from point to point. 

• non-Gaussian data: the likelihood is asymmetric and thus eiTors are, e.g 3.4l^ 2. Upper/lower limits, 
as 2.3 at 90 % probability, are (perhaps extreme) examples of asymmetric likelihood. 

• non uniform populations or data structure: the number of objects per unit parameter(s) is non- 
uniform. This is the source of the Malmquist- or Eddington- like bias that affect most astronomical 
quantities, as parallaxes, star and galaxy counts, mass and luminosity, galaxy cluster velocity disper- 
sions, supernovae luminosity corrections, etc. 

• intrinsic scatter: data often scatter more than allowed by the errors. The extra-scatter can be due to 
unidentified sources of errors, often called systematic errors, or indicates an intrinsic spread of the 
population under study, i.e. the fact that astronomical objects are not identically equal. 
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• noisy estimates of the errors: as every measurement, eiTors are known with a finite degree of pre- 
cision. This is even more true when one is measuring complex, and somewhat model dependent, 
quantities like mass. 

• non-random sampling: in simple terms, the objects in the sample are not a random sampling of 
those in the Universe. In some rare occasions in astronomy, sampling is planned to be non-random 
on purpose, but most of the times non-random sampling is due to selection effects: haider-to-observe 
objects are very often missed in samples. 

• mixtures: very often, large samples include the population under interest, but also contaminating 
objects. Mixtures also arise when one measure the flux of a source in presence of a background (i.e. 
always). 

• prior: we often known from past data or from theory that some values of the parameters are more 
likely than other. In order terms, we have prior knowledge about the parameter being investigated. If 
we known anything, not even the order of magnitude, about a parameter, it is difficult even to choose 
which instrument, or sample, should be used to measure the parameter. 

• non-linear: laws of Nature can be more complicated than y = ax + b. 

Bayesian methods allow to deal with these features (and also other ones), even all at the same time, as 
we illustrate in Sec 3 and 4 with two research examples, it is just matter of stating in mathematical terms our 
wordy statements about the nature of the measurement and of the objects being measured. The probabilistic 
(Bayesian) approach returns the whole (posterior) probability distribution of the parameters, very often in 
form of a Monte Carlo sampling of it. 

In this paper we make an attempt to be clear at the cost of being non-rigorous. We defer the reader 
looking for rigour to general textboox, as 0, and, to ||3l for our first research example. 

2 Parameter estimation in Bayesian Inference 

Before adressing a research example, let's consider an ideahzed applied problem to explain the basics of the 
Bayesian approach. 

Suppose one is interested in estimating the (log) mass of a galaxy cluster, IgM. In advance of collecting 
any data, we may have certain beliefs and expectations about the values of IgM. In fact, these thoughts 
are often used in deciding which instrument will be used to gather data and how this instrument may be 
configured. For example, if we plan to measure the mass of a poor cluster via the virial theorem, we will 
select a spectroscopic set up with adequate resolution, in order to avoid that velocity errors are comparable 
to, or lai^ger than, the likely low velocity dispersion of poor clusters. Crystalising these thoughts in the form 
of a probability distribution for IgM provides the prior p{lgM), on which relies the feasibility section of 
the telescope time proposal, where instrument, configuration and exposure time are set. 

For example one may believe (e.g. from the cluster being somewhat poor) that the log of the cluster 
mass is probably not far from 13, plus or minus 1; this might be modeled by saying that the (prior) proba- 
bility distribution of the log mass, here denoted IgM is a Gaussian centred on 13 and with a, the standard 
deviation, equal to 0.5, i.e. IgM ~ M{13, 0.5^). 

Once the appropriate instrument and its set up have been selected, data can be collected. In our exam- 
ple, this means we record a measurement of log mass, say obslgM200, via, for example, a virial theorem 
analysis, i.e. measuring distances and velocities. 

The likelihood describes how the noisy observation obslgM200 arises given a value of IgM. For ex- 
ample, we may find that the measurement technique allows us to measure masses in an unbiased way but 
with a standard error of 0.1 and that the error structure is Gaussian, ie. obslgM200 ~ J\f{lgM, 0.1^), where 



2 



the tilde symbol reads "is drawn from" or "is distributed as". If we observe obslgM200 = 13.3 we usually 
summarise the above by writing IgM = 13.3 ± 0.1. 

How do we update our beliefs about the unobserved log mass IgM in light of the observed measurement, 
obslgM2001 Expressing this probabilistically, what is the posterior distribution of IgM given obslgM200, 
i.e. p{lgM \ obslgM200)l Bayes Theorem (gl, O) tells us that 

p{lgM I obslgM200) a p{obslgM200 \ lgM)p{lgM) 

i.e. the posterior (the left hand side) is equal to the product of likelihood and prior (the right hand side) 
times a proportionaUty constant of no importance in parameter estimation. 

Simple algebra shows that in our example the posterior distribution of IgM \ obslgM200 is Gaussian, 

with mean /_f = ^^'x/o's^ [^1/0^(2'"'^ = 13.29 and = i/g 5:^+1/0 p — 0.0096. /U is just the usual weighted 
average of two "input" values, the prior and the observation, with weights given by prior and observation 
a's. 

From a computational point of view, only simple examples such as the one described above can gen- 
erally be tackled analytically, realistic analysis should be instead tacked numerically by special (Mai^kov 
Chain) Monte Carlo methods. These are included in BUGS-like programs (O) such as JAGS (Q), al- 
lowing scientists to focus on formalizing in mathematical terms our wordy statements about the quantities 
under investigation without worrying about the numerical implementation. In the idealized example, we just 
need to write in an ascii file the symbolic expression of the prior, IgM ~ J\f{13, 0.5^), and of likelihood, 
obslgM200 ~ M {IgM, O.l'^) to get the posterior in form of samplings. From the Monte Carlo sampling 
one may directly derive mean values, standard deviations, and confidence regions. For example, for a 90 % 
interval it is sufficient to peak up the interval that contain 90 % of the samplings. 

3 First example: predicting mass from a mass proxy 

Mass estimates are one of the holy grails of astronomy. Since these are observationally expensive to measure, 
or even unmeasurable with existing facilities, astronomers use mass proxies, far less expensive to acquire: 
from a(n usually small) sample of objects, the researcher measures masses, y and the mass proxy, x. Then, 
he regress x vs y and infer y for those objects having only x. This is the way most of the times galaxy 
cluster masses are estimated, for example using the X-ray luminosity. X-ray temperature, Yx, Ysz, the 
cluster richness or the total optical luminosity. Here we use the cluster richness, i.e. the number of member 
galaxies, but with minor changes this example can be adapted for other cases. 

ifn shows that predicted y using the Bayesian approach illustrated here ai^e more precise than any other 
method and that the Bayesian approach does not show the large systematics of other approaches. This 
means, in the case of masses, that more precise masses can be derived for the same input data, i.e. at the 
same telescope time cost. 

3.1 Step 1: put in formulae what you know 

3.1.1 Heteroscedasticity 

Clusters have widely different richnesses, and thus widely different errors. Some clusters have better de- 
termined masses than other. Heteroscedasticity means that errors have an index i, because they differ from 
point to point. 

3.1.2 Contamination (mixtures), non-Gaussian data and upper limits 

Galaxies in the cluster direction are both cluster members and galaxies on the line of sight (fore/background). 
The contamination may be estimated by observing a reference line of sight (fore/background), perhaps with 
a Ci times larger solid angle (to improve the quality of the determination). 
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The mathematical translation of our words, when counts are modeled as Poisson, is: 

obsbkgi ^ V{nbkgi) # Poisson with intensity nfe^j (1) 

obstoti ~ V{nbkgi/Ci + n200j) # Poisson with intensity {nbkgi/Ci + n200j) (2) 

The variables ri200i and nbkgi represent the true richness and the true background galaxy counts in the 
studied solid angles, whereas we add a prefix "obs" to indicate the observed values. 

Upper limits are automatically accounted for. Suppose, for exposing simplicity, that we observed 5 
galaxies, obstoti = 5, in the cluster direction and that in the control field direction (with Cj = 1 for 
exposing simplicity) we observe four background galaxies, obsbkg = 4. With one net galaxy and Poisson 
fluctuations of a few, n200j is poorly determined at best, and the conscientious researcher would probably 
reports an upper limits of a few. To use the information contained in the upper limit in our regression 
analysis we only need to list in the data file the raw measurements {Ci = 1, obstoti = 5, obsbkgi = 4), as 
for the other clusters. These data will be treated independently on whether an astronomer decides to report 
a measurement or an upper limit, because Nature doesn't caie about astronomer decisions. 

3.1.3 Non-linearity and extra-scatter 

The relation between mass, Af200, and proxy, n200, (richness) is usually parametrized as a power-law: 

M200i oc n200f 

Allowing for a Gaussian intrinsic scatter a scat (clusters of galaxies of a given richness may not all have 
the very same mass) and taking the log, previous equation becomes: 

lgM200i ~ M{a + /3 log(ri200i), cj^^^J # Gaussian scatter around (M200i oc n200f ) (3) 
where the intercept is a and the slope is /3. 

3.1.4 Noisy errors 

Once logged, mass has Gaussian errors. In formulae: 

obslgM200i J\f{lgM200i,af) # Gauss errors on Ig mass (4) 

However, errors (as everything) is measured with a finite degree of precision. We assume that the 
measured error, obserrlgM200i, is not biased (i.e. it is not systematically larger or smaller that the true 
enw, (Tj) but somewhat noisy. If a distribution is adopted, it satisfies both our request of unbiasness and 
noisiness. In formulae: 

obserrlgM200i ^ afx^/v # Unbiased enws (5) 

where the parameter u regulates the width of the distribution, i.e. how precise measured eiTors are. Since 
we are 95% confident that quoted eiTors are correct up to a factor of 2, 

V =6 # 95 % confident within a factor 2 (6) 
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3.1.5 Prior knowledge and population structure 

The data used in this investigation are of quality good enough to determine all parameters, but one, to a 
sufficient degree of accuracy that we should not care about priors and we can safely take weak (almost 
uniform) priors, zeroed for un-physical values of parameters (to avoid, for example, negative richnesses). 
The exception is given by the prior on the errors (i.e. cji), for which there is only one measurement per 
datum. The adopted prior (eq. 11) is supported by statistical considerations (see Q for details). The same 
prior is also used for the intrinsic scatter term, although any weak prior would return the same result, because 
this term is well determined by the data. 

a~ A^(0.0, 10^) # Almost uniform prior on intercept (7) 

/3 ~ ti # Uniform prior on angle (8) 

n200i U{0,oo) # Uniform, but positive, cluster richness (9) 

nhkgi ^ U{0,oo) # Uniform, but positive, background rate (10) 

1/af r(e,e) #Weak prior on eiTor (11) 

l/fjfc^j ~ r(e,e) # Weak prior on intrinsic scatter (12) 

Richer clusters are rarer. Therefore, the prior on the cluster richness is, for sure, not uniform, contrary to 
our assumption (eq. 9). Modeling the population structure is un-necessary for the data used in HU and here, 
but is essential if noiser richnesses were used. Indeed, Q shows that a previous published richness-mass 
calibration, which uses richnesses as low as obsn200 = 3 and neglects the n200 structure, shows a slope 
biased by five times the quoted uncertainty. Therefore, the population structure cannot be overlooked in 
general. 

3.2 Step 2: remove TeXing, perform stochastic computations and publish 

At this point, we have described, using mathematical symbols, the link between the quantities that matter 
for our problem, and we only need to compute the posterior probability distribution of the parameters by 
some sort of sampling (the readers with exquisite mathematical skills may instead attempt an analytical 
computation). 

Just Another Gibb Sampler (JAGSQ, (Q) can return it at the minor cost of de-TeXing equations 1 to 
12 (compare them to the JAGS code below). Poisson, Normal and Uniform distributions become dpois, 
dnorm, dunif , respectively. JAGS, following BUGS (O), uses precisions, prec = 1/cr^, in place of 
variances cr^. Furthermore, it uses neperian logarithms, instead of decimal ones. Eq. 5 has been rewritten 
using the property that the is a particular form of the Gamma distribution. Eq. 3 is split in two JAGS 
lines for a better reading. The arrow symbol reads "take the value of. obsvarlgM2 is the square of 
obserrlgM200. For computational advantages, log(n200) is centred at an average value of 1.5 and a is 
centred at -14.5. Finally, we replaced infinity with a large number. 

The model above, when inserted in JAGS, reads: 

model 
{ 

for (i in 1 : length (obstot ) ) { 

obsbkg[i] " dpois (nbkg [ i ] ) # eq 1 

obstot[i] dpois (nbkg [i] /C [i] +n200 [i] ) # eq 2 

n200[i] ' dunif (0, 3000) # eq 9 

nbkg[i] ~ dunif ( , 30 00 ) # eq 10 



http://calvin.iarc.fr/'~martyn/software/jags/ 
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Figure 1: Richness-mass scaling. The solid line marks the mean fitted regression line, while the dashed 
line shows this mean plus or minus the intrinsic scatter a scat- The shaded region marks the 68% highest 
posterior credible interval for the regression. Error bars on the data points represent observed errors for both 
variables. The distances between the data and the regression line is due in part to the measurement eiTor and 
in part to the intrinsic scatter. From ||3l, reproduced with permission. 
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JAGS samples the posterior distribution of all quantity of interests, such as intercept, slope and intrinsic 
scatter by Gibb sampling (a sort of Monte Carlo). Form these samplings, it is straightforward to compute 
(posterior) mean and standard deviations (by computing the average and the standard deviation!), to plot 
posterior marginals (by ignoring the values of the other parameters) and confidence contours, data and mean 
model, etc. Therefore, our effort is over, we only need to produce nice plots and summaries of our results. 

Figure 1 shows the data used in this analysis (see IS for details), the mean scaling (solid line) and 
its 68% uncertainty (shaded yellow region) and the mean intrinsic scatter (dashed lines) around the mean 
relation. The ±1 intrinsic scatter band is not expected to contain 68% of the data points, because of the 
presence of measurement errors. 

Figure 2 shows the posterior marginals for the intercept, slope and intrinsic scatter cjgcat- These marginals 
are reasonably well approximated by Gaussians. The intrinsic mass scatter at a given richness, ascat = 
'7/gAf200| iog?t200' is Small, 0.19 ± 0.03. (Unless otherwise stated, results of the statistical computations are 
quoted in the form x it y where x is the posterior mean and y is the posterior standard deviation.) 

The found relation is: 



lgM2m = (0.96 ± 0.15) (logn200 - 1.5) + 14.36 ± 0.04 



(13) 




1 2 -0.4-0.2 0.2 0.2 0.4 



slope /S interc. a intr. scatter cj^^^^^. 

Figure 2: Posterior probability distribution for the parameters of the richness-mass scaling. The black jagged 
histogram shows the posterior as computed by MCMC, marginalised over the other parameters. The red 
curve is a Gauss approximation of it. The shaded (yellow) range shows the 95% highest posterior credible 
interval. From tSJ, reproduced with permission. 



3.3 Predicting masses 

As mentioned, one of the reasons why astronomers regress a quantity x against another one, y, is to predict 
the latter when a direct measurement is missing (usually because observationally expensive to acquire). It 
is clear that the uncertainty on the predicted y, called y hereafter, should account for: a) the intrinsic scatter 
between y and x (a larger scatter implies a lower quality y estimate); b) the uncertainty of the x (the larger 
it is, the noisier will be the y; c) the quality of the calibration between y and x (better determined relations 
should return more precise estimates of y); and d) extrapolation errors, i.e. should penalize attempts to infer 
y values corresponding to x values absent from the calibrator sample (e.g. outside the range sampled by it). 
All these requirements are satisfied using the posterior predictive distribution, 

p{y) = J p{y\e)p{&\y)de (i4) 

where are the regression parameters (intercept, slope, intrinsic scatter). This apparently ugly ex- 
pression is easy to understand: one should combine (multiply) the uncertainties of the calibrating rela- 
tion, p{6\y), to the uncertainty of predicting new data if the calibrating relation were perfectly known, 
p{y\0). Since we are now interested in predicted values only, we get rid of non-interesting parameters {6) 
by marginalization (integration). 

Posterior predictive distributions aie so basic to be introduced at page 8 of the > 700 pages "Bayesian 
Data Analysis" book (IH) and to be offered as a standard output of JAGS. Of course, we need to list the x 
values (richnesses), and errors of the clusters for which we want to infer y (predicted mass) in the data file, 
listing for example in the data file obsbkg = 12, obtot = 32, C = 5, and mass obslgM200 = NA ("not 
available"), to indicate that this quantity should be estimated using the regression computed from the points 
with available masses and galaxy counts. JAGS returns p{y) in form of sampling and therefore, as for any 
other parameter, a point estimate may be obtained by taking the average and a 68 % (credible) interval can 
be derived by taking the interval including 68 % of the samplings. Returned values behave as expected, and 
indeed have large errors when masses are estimated for clusters with richnesses outside the range where the 
cahbration has been derived (ISll) 

4 Second example: Cosmological parameters from SNIa 

Supemovae (SNIa) ai^e very bright objects with very similar luminosities. The luminosity spread can be 
made even smaller by accounting for the correlation with colour and stretch parameter (the latter is a mea- 
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Figure 3: Apparent magnitude vs redshift of the SNIa sample (upper panels), and their residual from a 
Qm = 0.3, Qa = 0.7 cosmological model (bottom panels) before (left panels) or after (right panels) 
correcting for stretching and color parameter. 

surement of how slowly SNIa fade), as illustrated in Figure 3 for the sample in f9\. These features make 
SNIa very useful for cosmology: they can be observed far away and the relation between flux (basically 
the rate of photons received) and luminosity (the rate of photons emitted) is modulated by the luminosity 
distance (to the square), which in turns is function of the cosmological parameters. Therefore, measuring 
SNIa fluxes (and redshift) allows us to put constraints on cosmological parameters. The only minor com- 
plication is that SNIa luminosities are function of their colour and stretch parameter, and these parameters 
have an intrinsic scatter too, which in turns has to be determined from the data at the same time as the other 
parameters. 

ifTOl shows that the Bayesian approach delivers tighter statistical constraints on the cosmological param- 
eters over 90 % of the times, that it reduces the statistical bias typically by a factor ~ 2 — 3 and that it has 
better coverage properties than the usual chi-squared approach. 

In this second example we can proceed a bit faster in illustrating this non-lineai^ regression with het- 
eroscedastic errors, non-uniform data structure and intrinsic scatter. In this example, we also briefly discuss 
the prior sensitivity, i.e. how much the results are affected by the chosen prior, and we also check the quality 
of the model fit. 
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error rank 



Figure 4: Observed values of the stretch parameters, obsxi, and of the colour parameter, obsci ranked by 
error. Points scatter more than the error bars (see the left side of the figure). The dashed lines indicate the 
size of the intrinsic scatter as determined by our analysis. 

4.1 Step 1: put in formulae what you know 

We observe SNIa magnitudes obsnii (= —2.5log{flux) + c) with Gaussian errors am,i, i-e- 

obsirii Af{mi,af-^^) (15) 
These rrii are related to the distance modulus distmodi, via 

rui = M + distmodi — axi + (3 Ci 
with a Gaussian intrinsic scatter a scat- More precisely: 

mi ~ M {M + distmodi — a Xi + fi a, al^^i) (16) 

where M is the (unknown) mean absolute magnitude of SNIa, and a and /3 allow to reduce the SNIa 
luminosity scatter by accounting for the correlation with the stretch and colour parameters. 
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Similarly to lITOl . the M, a, /3 and log a scat priors ai^e taken uniform in a wide range: 



logio CT.cat- Z^(-3,0) (17) 

a~ U{-2,2) (18) 

/3~ W(-4,4) (19) 

M~ W(-20.3, -18.3) (20) 

Xi and Cj are the true value of the stretch and colour parameters, of which we observe (the noisy) obsxi 
and obsci with errors ax i and ac %■ In formulae: 



obsxi ^ A/'(xj,(T^ j) (21) 
o6sc, ~ M{c^,ali) (22) 

The key point of the modeling is that the obsxi and obsci values scatter more than their errors, but not 
immensely so, see Fig 4. The presence of a non-uniform distribution induces a Malmquist-like bias if not 
accounted for (e.g. large obsxi values are more likely low Xi values scattered to large values than vice versa, 
because of the larger abundance of low Xi values). Therefore, we model, as lITOl do, the individual Xi and 
Ci as drawn from independent normal distributions centered on xm and cm with standard deviation Rx and 
i?c respectively. In formulae: 



~ M{xm,Rl) (23) 
a ~ A/'(cm, Rl) (24) 

We take uniform priors for xm and sc, and uniform priors on log Rx and on log Rc, between the indi- 
cated boundaries: 



2;m~ U{-W,+10) (25) 

cm~ U{-3,+3) (26) 

logioi?x~ Ui-5,+2) (27) 

logioiic- U{-5,+2) (28) 



That's almost all: we need to remember the definition of distance modulus: 



distmodi = 25 + 5 logio (29) 

where the luminosity distance, dl is a complicate expression, involving integrals, of the redshift Zi 
and the cosmological parameters J^a, ^m,w, Hq (see any recent cosmology textbook for the mathematical 
expression). 

Redshift, in the considered sample, have heteroscedastic Gaussian errors az i'. 



obszi^ M{zi,ali) (30) 

The redshift prior is taken uniform 
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Z^(0,2) 



(31) 



Supemovae alone do not allow to determine all cosmological parameters, so we need external prior on 
them, notably on Hq, taken from ifTTl to be 



Fo~ AA(72,82) (32) 

At this point, we may decide which sets of cosmological models we want to investigate using SNIa, for 
example a flat universe with a possible w ^ with the following priors: 



Qm ~ K{0, 1) (33) 
nk= (34) 
Z^(-4,0) (35) 



or a curved Universes with w 



or any other set. 

Both considered cosmologies have: 



Qm ~ ^(0, 1) (36) 
Ui-1,0) (37) 
w= -1 (38) 



nk= i-nm-n^ (39) 

Finally, one may want to use some data. As shortly mentioned, we use the compilation of 288 SNIa in 

n. 

4.2 Step 2: remove TeXing, perform stochastic computations and publish 

Most of the distributions above are Normal, and the posterior distribution can be almost completely analyt- 
ically computed (" ifTOl ). However, numerical evaluation of the stochastic part of the model on an (obsolete) 
laptop takes about one minute, therefore there is no need for speed up. Instead, the evaluation of the lu- 
minosity distance is CPU intensive (it takes 10'^ more times, unless approximate analytic formulae for 
the luminosity distance ai^e used), because an integral has to be evaluated a number of times equal to the 
number of supernovae times the number of target posterior samplings, i.e. about four millions times in our 
numerical computation. The JAGS implementation of the luminosity distance integral is implemented as a 
sum over a tightly packed grid on redshift. 

As the previous example, eq 15 to 32 can be de-TeXed and used in JAGS, adding one of the two set of 
priors, 33-35 or 36-38, depending on which problem one is interested in. 

data { 

# JAGS like precisions 
precmag <-l/errmag/errmag 
precobsc <- 1/errobsc/errobsc 
precobsx <- 1/errobsx/errobsx 
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precz <- 1/errz/errz 

# grid for distance modulus integral evaluation 
for (k in 1 : 1500) { 

grid.z[k] <- (k-0 . 5 ) /1 00 . 

} 

step. grid. z <-grid. z [2] -grid. z [1] 

} 

model { 

for (i in 1 : length (obsz ) ) { 

obsm[i] ~ dnorm (m [ i ] , precmag [ i ] ) # eq 15 

m[i] ~ dnorm (Mm+distmod [ i ] - alpha* x[i] + beta*c[i], precM) # eq 16 
obsc[i] ~ dnorm(c[i], precobsc[i]) # eq 22 

c[i] ~ dnorm (cm, precC) # eq 24 

obsx[i] ~ dnorm(x[i], precobsx[i]) # eq 21 

x[i] ~ dnorm(xm, precx) # eq 23 

# distmod definition & HO term 

distmod[i] <- 25 + 5/2.3026 * log(dl[i]) -5/2.3026* log (H0/300000) 
z [i] - dunif (0, 2) # eq 31 

obsz[i] ~ dnorm ( z [ i ], precz [ i ] ) # eq 30 

######### dl computation (slow and tedious) 

tmp2[i] <- sum ( step ( z [ i ] -grid . z ) * (1+w) / (1+grid.z)) * step. grid. z 
omegade[i] <- omegal * exp(3 * tmp2[i]) 

xx[i] <- sum (pow ( ( 1+gr id . z ) " 3*omegam + omegade[i] + ( 1+grid . z ) " 2 *omegak, -0 . 5 ) * 
*step.grid.z * step ( z [ i ] -grid . z ) ) 

# implementing if, to avoid diving by added le-7 to omegak 

zz[l,i] <- sin (XX [i] *sqrt (abs (omegak) ) ) * ( 1+z [ i ] ) / sqrt (abs (omegak+le-7 ) ) 
zz[2,i] <- xx[i] * (l+z[i]) 

zz[3,i] <- (exp (XX [ i ] *sqrt (abs (omegak) )) -exp (-XX [ i ] *sqrt (abs (omegak) ))) /2 * 

* (1+z [i] ) /sqrt (abs (omegak+le-7) ) 
dl[i] <- zz[b,i] 
} 

b <- 1 + (omegak==0) + 2* (omegak > 0) 
########## end dl computation 



# JAGS uses precisions 

precM <- 1/ intrscatM /intrscatM 
precC <- 1/ intrscatC /intrscatC 
precx <- 1/ intrscatx /intrscatx 

# priors 



Mm~ dunif(-20.3, -18.3) 


# 


eq 


20 


alpha ~ dunif (-2 , 2 . ) 


# 


eq 


18 


beta ~ dunif (-4, 4 . 0) 


# 


eq 


19 


cm " dunif (-3, 3) 


# 


eq 


26 


xm - dunif (-10, 10) 


# 


eq 


25 


# uniform prior on logged quantities 








intrscatM <- pow ( 1 , Igint rscatM) 


# 


eq 


17 


IgintrscatM " dunif (-3,0) 


# 


eq 


17 


intrscatx <- pow ( 1 , Igint rscatx) 


# 


eq 


27 


Igintrscatx " dunif (-5,2) 


# 


eq 


27 


intrscatC <- pow (10, IgintrscatC) 


# 


eq 


28 


IgintrscatC " dunif (-5,2) 


# 


eq 


28 


Icosmo priors 








HO ~ dnorm(72, 1/8 . /8 . ) 


# 


eq 


32 


omegal <-l-omegam-omegak 


# 


eq 


39 


# cosmo priors 1st set LCDM 








#omegam~ dunif (0,1) 


# 


eq 


36 


#omegak~ dunif (-1 , 1 ) 


# 


eq 


37 


#w <- -1 


# 


eq 


38 


# cosmo priors 2nd set: wCDM 








omegam'dunif (0,1) 


# 


eq 


33 



12 




0.1 0.2 0.05 0.1 0.15 0.5 1 



mag intr. scatter ct^^^^^ col intr. scatter R_, stretch intr. scatter 

(a) 

Figure 5: Prior and posterior probability distribution for the three intrinsic scatter terms in the SNIa prob- 
lem. The black jagged histogram shows the posterior as computed by MCMC, marginalised over the other 
parameters. The red curve is a Gauss approximation of it. The shaded (yellow) range shows the 95% highest 
posterior credible interval. The adopted priors are indicated by the blue dotted curve. 



omegak <-0 # eq 34 

w - dunif (-4,0) # eq 35 

} 



Figure 5 shows the prior (dashed blue line) and posterior (histogram) probability distribution for the 
three intrinsic scatter terms present in the cosmological parameter estimation: the scatter in absolute lumi- 
nosity after colour and stretch corrections, (agcat), the intrinsic scatter in the distribution of the colour and 
stretch terms (Rc and Rx)- This plot shows that the posterior probability at intrinsic scatters near zero is 
approximately zero and thus that the three intrinsic scatter terms are necessary parameters for the modeling 
of SNIa, and not useless complications. The three posteriors are dominated by the data, being the prior quite 
flat in the range where the posterior is appreciably not zero (Figure 5). Therefore, any other prior choice, as 
long as smooth and shallow over the shown parameter range, would have returned indistinguishable results. 

Not only SNIa have luminosities that depend on colour and stretch terms, but these in turns have their 
own probability distribution (taken Gaussian for simplicity) with a well determined width. Figure 6 depicts 
the Malmquist- like bias one should incur if the spread of the distribution of colour and stretch parameters 
is ignored: it reports the observed values (as in Fig 4), obsxi and obsci as well as the true values Xi and Cj 
(posterior means). The effect of equations 23 and 24 is to pull values toward the mean, and more so those 
with large errors, to compensate the systematic shift (Malmquist-lrke bias) toward larger observed values. 

Figure 7 shows the probability distribution of the two the colour and stretch slopes: a = 0.12 it 0.02 
and /3 = 2.70 ± 0.14 respectively. As for the intrinsic scatter terms, the posterior is dominated by the data 
and therefore any other prior, smooth and shallow, would have returned indistinguishable results. 

Finally, Fig 8 reports perhaps the most wanted result: contours of equal probability for the cosmological 
parameters Q m and w. 

For one dimensional marginals, we found: Q^m = 0.40 ± 0.10 and w = —1.2 it 0.2, but with non- 
Gaussian probability distributions. 

4.3 Model checking 

The work of the careful researcher does not end by finding the parameter set that best describe the data, (s)he 
also checks whether the adopted model is a good description of the data, or it is misspecified, i.e. in tension 
with the fitted data. In the non-Bayesian paradigm this is often achieved by computing a p-value, i.e. the 
probability to obtain data more discrepant than those in hand once parameters are taken at the best fit value. 
The Bayesian version of the concept (e.g. Q) acknowledges that parameters ai^e not perfectly known, and 
therefore one should also explore, in addition to best fit value, other values of the parameters. Therefore, 
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Figure 6: Effect of the population structure for the stretch and the colour parameters. Each tick goes from 
the observed value to the posterior mean. The population modeling attempt to counterbalance the increased 
spread (Malmquist-like), especially those with larger error (on the right, in the figures), pulling values toward 
the mean. 




Figure 7: Prior and posterior probability distribution for the colour and stretch slopes of the SNIa problem. 
The black jagged histogram shows the posterior as computed by MCMC, marginalised over the other pa- 
rameters. The red curve is a Gauss approximation of it. The shaded (yellow) range shows the 95% highest 
posterior credible interval. The adopted (uniform) priors are indicated by the blue dotted curve. 
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Figure 8: Constraints on the cosmological parameters Om and w. The two contours delimit 68 % and 95 % 
constraints. 



discrepancy becomes a vector, instead of a scalar, of dimension j, that measure the distance between the data 
and j models, one per set of parameters considered. Of course, more probable models should occur more 
frequently in the list to quantify that discrepancy from an unlikely model is less detrimental than discrepancy 
from a likely model. In practice, if parameters aie explored by sampling, it is just matter of computing the 
discrepancy of the data in hand for each set j of parameters stored in the chain, instead of relying on one 
single set of parameter (say, those that maximize the likelihood). Then, one repeats the computation for fake 
data generated from the model, and counts how many times fake data are more extreme of real data. 

For example, if we want to test the modeling of the observed spread of magnitude (i.e. equation 15 and 
16), let's define: 



mcori = M + distmodi — axi + j3 Ci (40) 
We generate fake supemovae mag: 

m.fakei^ M {mcori, a^^^i) (41) 
and fake observed values of them, 

obsm.fakei^ J\f{m.fakei,a'^i) (42) 

Then, we adopt a modified to quantify discrepancy (or its contrary, agreement). For the real data set 
we have: 

2 _ V- {obsmi - mcorjjf 

where summation is over the data and j refers to the index in the sampling chain. 

Apart for the j index, eq. 43 is just the usual the difference between observed, ohsmi, and true 
mcori, values, weighted by the expected variance, computed as quadrature sum of eiTors, am,i, and super- 
novae mag intrinsic scatter Ggcat- The of the fake data set, X/afce j i^' 

2 _ {ohsm.fakejj - mcorijf 
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Figure 9: Standardized residuals (histogram) and their expected distribution (blue curve). 



At this point, we only need to compute for which fraction of the simulations Xjake j > Xreai j ^^'^ quote 
the result. If the modeling is appropriate, then the computed fraction (p-value) is not extreme (far from zero 
or one). If not, our statistically modeling need to be revised, because the data are in disagreement with the 
model. 

We performed 15000 simulation^ each one generating 288 fake measurements of SNIa measurements. 
In practice, we added the following three JAGS lines: 

mcor [i] <-Mm+distmod [i] - alpha* x[i] + beta*c[i] # eq 40 
m.fake[i] ~ dnorm (mcor , precM) # eq 41 

obsm.fake[i] ~ dnorm (m. fake [ i ], precmag [ i ] ) # eq 42 

and we can simplify eq 16 in 



m 1 



dnorm (mcor [ i ] , precM) 



# eq 16 



We found a p-value of 45 %, i.e. that the discrepancy of the data in hand is quite typical (similar- to 
the one of the fake data). Therefore, real data ai^e quite common and the tested part of the model shows 
no evidence of misspecification. The careful researcher should then move to the other parts of the model, 
whose detailed exploration is left as exercise. 

In such exploration of possible model misfits, it is very useful to visually inspect several data summaries 
to guide the choice of which discrepancy measure one should adopt (eq. 43 or something else?), and, if the 
adopted model turns out to be unsatisfactory, to guide how to revise the modeling of the the tested part of 
the model. For example, a possible (and common) data summary is the distribution of normalized residuals, 
that for obsxi reads: 



stdresohsxj 



obsxi — E{xi) 



(45) 



i.e. observed minus expected value of Xi divided by their expected spread (the sum in quadrature of 
errors and intrinsic spread). A similar summary may be built for obsci too. To first order (at least), stan- 
dardized residuals should be normal distributed with standai^d deviation one (by construction). Fig 9 shows 
the distribution of normalized residuals of both obsxi and obsci, with superposed a Gaussian centered in 
with standard deviation equal to one (in blue). Both distributions show a possible hint of asymmetry. At this 
point, the careful researcher may want to use a discrepancy measure sensitive to asymmetries, as the skew- 
ness index, in addition to the during model testing. While leaving the actual computation to the reader, 
we emphasize that if an extreme Bayesian p-value if found (on obsxi for exposing simplicity), then one may 



Skilled readers may note that we are dealing, by large, with Gaussian distributions, and may attempt an analytic computation. 
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replace its modeling (eq. 23 in the case of obsxi) with a distribution allowing a non-zero asymmetry and 
this can be easily performed in a Bayesian approach, and easily implemented in JAGS, it is just matter of 
replacing the adopted Gaussian with an asymmetric distribution. If instead the data exploration gives an hint 
of double-bumped distribution (again on obsxi for exposing simplicity), and an extreme Bayesian p-value 
is found when a measure of discrepancy sensitive to double-bumped distributions is adopted, then one may 
adopt a mixture of Gaussians, replacing (eq. 23 in the case of obsxi) with 

ohsxi ~ \N{xi,al i) + (1 - X)M{xxi,al^ i) (46) 

Even more simple is the (hypothetical) case of possible distribution (again of ohsx-i for exposing sim- 
plicity) with fat tails: one may adopt a Cauchy distribution. In such case, coding in JAGS is it is just matter 
of replacing a dnorm with dt in the JAGS implementation. And so on. 

In summary, model checking consists in updating the model until it produces data similar to those in 
hand. One should start by carefully and attentively inspect the data and their summaries. This inspection 
should suggest a discrepancy measure to be used to quantify the model misfit and, if one is found, to guide 
the model updating. The procedure should be iterated until the model fully reproduces the data. 

5 Summary and conclusions 

These two analysis offer a template for modeling the common awkward features of astronomical data, 
namely heteroscedastic errors, non-Gaussian likelihood (inclusive of upper/lower limits), non-uniform pop- 
ulations or data structure, intrinsic scatter (either due to unidentified source of errors, or due to population 
spreads), noisy estimates of the errors, mixtures and prior knowledge. 

In a Bayesian framework, learning from the data and the prior it is just matter of formalizing in math- 
ematical terms our wordy statements about the quantities under investigation and how the data arize. The 
actual numerical computation of the posterior probability distribution of the parameters is left to (special) 
Monte Carlo programs, of which we don't need to be afraid more than numerical methods (Monte Carlo) 
used to compute the integral of a function. 

The great advantage of the Bayesian modeling is its high flexibility: if the data (or theory) call for a 
more complex modeling, or call for using distributions different from those initially taken, it is just matter 
of replacing them in the model, because there are no simplifications forced by the need of reaching the 
finishing line, as instead is the case in other modeling approaches. Furthermore, if one is interested in 
constraining another cosmological model, for example one with a redshift-dependent dark energy equation 
of state w = wq + wi{1 + z)/{l + Zp), one should just replace w in the JAGS code with the above equation. 
This flexible, hierarchical, modeling is native with Bayesian methods. 

In both our two examples, the Bayesian approach performs, unsurprisingly, better than than non-Bayesian 
methods obliged to discard part of the available information in order to reach the finishing line: Bayesian 
methods just use all the provided information. 

As a final note, we remember that the careful researcher, whether using a Bayesian modeling or not, 
before publishing his own result should check that the numerical computation is adequate for his purpose 
and that the model is appropriate for the data. Therefore, if you, gentle reader, are using our examples as 
template, remember to include in your work a sensitivity analysis, by checking that your assumptions (both 
likelihood and prior) are reasonable. Some prior sensitivity analysis has been performed in both examples 
to emphasize the importance of showing how much conclusions (the posterior) rely on assumptions (prior). 
For what concerns the likelihood, i.e. misfitting, the key point consists in updating the model fomulation 
until it produces data similar to those in hand, as we have shown in great detail for the second example. 

More in general, we hope that the template modeling shown in these two examples may be useful for any 
analysis confronted with modeling the awkward features of astronomical data, among which heteroscedastic 
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(point-dependent) enws, intrinsic scatter, data structure, non-uniform population (often called Malmquist 
bias) and non-Gaussian data, inclusive of upper/lower limits. 

Acknowledgements: The first part of this chapter is largely based on papers written in collaboration with 
Merrilee Hurn. It is a pleasure to thank Memlee for the fruitful collaboration and her wise suggestions along 
the yeai^s, and for comments on an early draft of this chapter. Errors and inconsistencies remain my own. 
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